github header

Telco Customer Churn Project: Summary Paper

1 Chapter 1: Introudction of Company

1.1 Basic Information

Telco Systems is a company which have been working on design and development of high-performance network communications over 40 years. It is the global leader in telecommunications, providing excellent telecommunication service for customers. The service includes 5G internet, networking slicing, and more on. In our group, all of us have not tried to analyze such a company in telecommunication field.

1.2 Reason of Choosing Telco Systems

The reason of choosing this company is that Telco provides large amount of home phone service and internet service. These two stuffs are unavoidable in contemporary society. People get used to take mobile phones to go outside, and they adapt to such a life with internet. When people stay at home, they spend majority of times with interenet no matter working or playing. We can say that people cannot leave internet nowadays. Therefore, the company which is providing these services are pretty crucial. Telco is a good company, but it also has some drawbacks. Not all users satisfy with service, so we want to know in what place Telco can make an improvement. We care about how customers say about Telco, and what they dislike. We are here to help the company to find specific problems, to avoid unnecessary customers churn. Through analyzing the company, we wonder what factors Telco can improve to let customers have greater experiences of using. The best way to know the using experience is based on customers’ survey. Therefore, we choose a dataset about customers survey.

2 Chapter 2: Description of Dataset

2.1 About Dataset

The Telco customer churn data contains information about Telco that provided home phone and Internet services to 7043 customers in California at the end of 2017 Quarter 3. Data includes customers’ basic information and it indicates which customers have left, stayed, or signed up for their service.
Studying such data can help companies identify the characteristics of lost customers, identify potential, soon-to-be-lost customers and develop appropriate strategies to retain them.
The dataset is WA_Fn-UseC_-Telco-Customer-Churn.csv. Before analyzing this dataset, we did some research about what churn represents, and why it is important to avoid churn in business. Churn in this dataset represents lost customers. Some people will be curious about why the company should spend time on retaining current customers or decreasing lost customers. In fact, acquire a new customer is much harder than retaining an existing customer. Company can pay for fewer price to retain existing customers rather than spend large amount of money on advertisement, and it is a profound strategy to maintain good reputation. Therefore, it is crucial to figure out current problems, and then to fix it up.

2.2 Variables

str(customer)
## 'data.frame':    7032 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
##   ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...
  • gender: Female or Male
  • SeniorCitizen: customer is a senior citizen or not (Yes, No)
  • Partner: customer has a partner or not (Yes, No)
  • Dependents: customer has dependents or not (Yes, No)
  • tenure: number of months the customer has stayed with the company
  • PhoneService: customer has a phone service or not (Yes, No)
  • MultipleLines: customer has multiple lines or not (Yes, No, No phone service)
  • InternetService: customer’s internet service provider (DSL, Fiber optic, No)
  • OnlineSecurity: customer has online security or not (Yes, No, No internet service)
  • OnlineBackup: customer has online backup or not (Yes, No, No internet service)
  • DeviceProtection: customer has device protection or not (Yes, No, No internet service)
  • TechSupport: customer has tech support or not (Yes, No, No internet service)
  • StreamingTV: customer has streaming TV or not (Yes, No, No internet service)
  • StreamingMovies: customer has streaming movies or not (Yes, No, No internet service)
  • Contract: contract term of the customer (Month-to-month, One year, Two year)
  • PaperlessBilling: customer has paperless billing or not (Yes, No)
  • PaymentMethod: Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)
  • MonthlyCharges: amount charged monthly
  • TotalCharges: total amount charged
  • Churn: customer churned or not (Yes or No)

For our exploratory data analysis, we did some preprocessing including cleaning up and converting. We dropped “NA” values from the dataset to simplify our analysis; we converted all variables into factor variables except tenure, MonthlyCharges, TotalCharges

3 Chapter 3: Categorical Variables EDA

A brief overview of the dataset tells that there are 1869 customers left the telephone service company and 5163 who didn’t. Below we’ll call these left customers as churned customers.

3.1 Churn vs Not Churn

3.1.1 What are churned customers look like?

Here we have picked out a few factors from the summary that have a significant difference between factor levels.
Churned Customer Attributes
SeniorCitizen Partner Dependents PhoneService InternetService PaperlessBilling
X No :1393 No :1200 No :1543 No : 170 DSL : 459 No : 469
X.1 Yes: 476 Yes: 669 Yes: 326 Yes:1699 Fiber optic:1297 Yes:1400
X.2 NA NA NA NA No : 113 NA

We see that most churned customers are senior citizens. Indeed, the age limit can cause them to leave. Also most churned customers have no dependents, which means they may be older and have their own considerations about choosing a phone company. From the table, we also see that most churned customers don’t have a partner and have signed up for phone service and paperless billing. However, we cannot assume a direct reason at this time, so we’ll talk about this later.
Customers who signed up for an internet service with Fiber Optic quit most. This may be because that they are not satisfied with Fiber Optic, but this speculation must be based on the assumption that total numbers of customers with Fiber Optic and DSL are nearly equal. From the summary, there are 2416 customers with DSL and 3096 customers with Fiber Optic, therefore, the speculation holds.

Then we want to take a deeper look into churned customers who have signed up for Internet Services.
Attributes of Churned Customer with Internet Servies
OnlineSecurity OnlineBackup DeviceProtection TechSupport
X No :1461 No :1233 No :1211 No :1446
X.1 No internet service: 0 No internet service: 0 No internet service: 0 No internet service: 0
X.2 Yes : 295 Yes : 523 Yes : 545 Yes : 310

Here we see significant impacts with these four internet service add-on. Most churned customers didn’t sign up for these four add-on.

Now let’s look at the contract and payment method.
Churned Customer Attributes2
Contract PaymentMethod
X Month-to-month:1655 Bank transfer (automatic): 258
X.1 One year : 166 Credit card (automatic) : 232
X.2 Two year : 48 Electronic check :1071
X.3 NA Mailed check : 308

Most churned customers have short-term(month-to-month) contract and paid bills with electronic check.

3.1.2 Contrast Churned with not Churned

Since the sample sizes of churned and not churned customers are different, we can’t compare the numbers of customers directly for each attributes. Instead, we compare the percentage.
Here we see the factors that have a significant impact on customer churn or not. Just as we mentioned earlier, these factors also characterize most churned customers.

senior <- ggplot(customer, aes(x=SeniorCitizen, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "Not senior", "Yes" = "Senior")) +
                   labs(title="SeniorCitizen", x="", y="Percentage") +
                   theme(legend.position="top")
partner <- ggplot(customer, aes(x=Partner, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "Have no partner", "Yes" = "Have a partner")) +
                   labs(title="Partner", x="", y="Percentage") +
                   theme(legend.position="top")
dependent <- ggplot(customer, aes(x=Dependents, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "Have no dependents", "Yes" = "Have dependents")) +
                   labs(title="Dependent", x="", y="Percentage") +
                   theme(legend.position="top")
internet_service <- ggplot(customer, aes(x=InternetService, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Internet services", x="Internet company", y="Percentage") +
                   theme(legend.position="top")
online_security <- ggplot(haveIntAll, aes(x=OnlineSecurity, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Online security", x="", y="Percentage") +
                   theme(legend.position="top")
online_backup <- ggplot(haveIntAll, aes(x=OnlineBackup, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Online backup", x="", y="Percentage") +
                   theme(legend.position="top")
device_protection <- ggplot(haveIntAll, aes(x=DeviceProtection, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Device protection", x="", y="Percentage") +
                   theme(legend.position="top")
tech_support <- ggplot(haveIntAll, aes(x=TechSupport, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Tech support", x="", y="Percentage") +
                   theme(legend.position="top")
contract <- ggplot(customer, aes(x=Contract, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Contract", x="Method", y="Percentage") +
                   theme(legend.position="top")
paperless_billing <- ggplot(customer, aes(x=PaperlessBilling, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "No paperless billing", "Yes" = "Have paperless billing")) +
                   labs(title="Paperless billing", x="", y="Percentage") +
                   theme(legend.position="top")
payment_method <- ggplot(customer, aes(x=PaymentMethod, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("Bank transfer (automatic)" = "bank transfer", "Electronic check" = "E-check", "Credit card (automatic)" = "credit card")) +
                   labs(title="Payment method", x="Method", y="Percentage") +
                   theme(legend.position="top")
ggarrange(senior, partner, dependent, paperless_billing)

ggarrange(internet_service, online_security, online_backup, device_protection, tech_support)

ggarrange(contract, payment_method, nrow=2, ncol=1)

Remember we previously talked about the partner, phone service and paperless bill, which we are unsure why they have an impact on customer churn, they also create differences between churned and not churned customers. Let’s find out whether these factors have significant impacts on churn with tests.

3.2 Chi-square Tests

We use \(\chi^2\) test to test if two categorical variables are independent base on contingency table.

3.2.1 Test Partner

churn_vs_partner = table(customer$Churn, customer$Partner)

Are they independent?
- \(H_0\): churn and partner are independent.
- \(H_1\): they are not independent.

chi_test1 = chisq.test(churn_vs_partner)
chi_test1
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  churn_vs_partner
## X-squared = 158, df = 1, p-value <2e-16

Since p-value = 3.97e-36 < 0.05, we reject null hypothesis, so partner actually has a significant impact on churn.

3.2.2 Test PhoneService

churn_vs_phone = table(customer$Churn, customer$PhoneService)
churn_vs_phone

Are they independent?
- \(H_0\): churn and phone service are independent.
- \(H_1\): they are not independent.

chi_test2 = chisq.test(churn_vs_phone)
chi_test2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  churn_vs_phone
## X-squared = 0.9, df = 1, p-value = 0.3

Since p-value = 0.35 > 0.05, we fail to reject null hypothesis, so phone service does not have a significant impact on churn.

3.2.3 Test PaperlessBilling

churn_vs_bill = table(customer$Churn, customer$PaperlessBilling)
churn_vs_bill

Are they independent?
- \(H_0\): churn and paperless bill are independent.
- \(H_1\): they are not independent.

chi_test3 = chisq.test(churn_vs_bill)
chi_test3
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  churn_vs_bill
## X-squared = 257, df = 1, p-value <2e-16

Since p-value = 8.24e-58 < 0.05, we reject null hypothesis, and paperless bill service has a significant impact on churn.

4 Chapter 4: Continuous Variables EDA

A brief overview of the dataset tells that there are 3 continuous variables, tenure, monthlycharges and totalcharges.

4.1 KDE plot of continuous variables

KDE plot is a Kernel Density Estimation Plot which depicts the probability density function of the continuous data variables. We can easily observe the distribution of samples with kde plot and when we want to compare the distributions of different samples, it won’t be affected by the samples’ size.

tenure_kdeplot <- ggplot(data = customer, aes(x = tenure, color = Churn)) + 
            geom_density(aes(fill = Churn), alpha = 0.8) + 
             scale_fill_manual(values=c("pink3", "steelblue")) +
              labs(title="KDEplot for tenure") +
               labs(x="tenure", y="density") +
                theme(legend.position="top")
tenure_kdeplot

MonthlyCharges_kdeplot <- ggplot(data = customer, aes(x = MonthlyCharges, color = Churn)) + 
                   geom_density(aes(fill = Churn), alpha = 0.8) + 
                    scale_fill_manual(values=c("pink3", "steelblue")) +
                     labs(title="KDEplot for MonthlyCharges") +
                      labs(x="MonthlyCharges", y="density") +
                       theme(legend.position="top")
MonthlyCharges_kdeplot

TotalCharges_kdeplot <- ggplot(data = customer, aes(x = TotalCharges, color = Churn)) + 
                   geom_density(aes(fill = Churn), alpha = 0.8) + 
                    scale_fill_manual(values=c("pink3", "steelblue")) +
                     labs(title="KDEplot for TotalCharges") +
                      labs(x="TotalCharges", y="density") +
                       theme(legend.position="top")
TotalCharges_kdeplot

As we can see from tenure_kdeplot, customers with lower tenure are more likely to churn. And from MonthlyCharges_kdeplot, customers with higher monthlycharges are also more likely to churn. From TotalCharges_kdeplot, we can find that churn customers and left customers have very similar distributions. From these 3 kde plots, which means, tenure may be negatively correlated with customer churn rates and monthlycharges may be positively correlated with customer churn rates. Finally, totalcharges may only make a little attribution to customer churn rates.

4.2 Logistic regression

Logistic regression is the appropriate regression analysis to predict a binary outcome (the dependent variable) based on a set of independent variables.

To verify the conclusion we drew from kde plots numerically, we use the logistic regression model to classify churn with different features.

We can see from the anova test results.

lm1 <- glm(Churn~tenure, family = binomial(link = "logit"), data = customer)
lm2 <- glm(Churn~tenure + MonthlyCharges, family = binomial(link = "logit"), data = customer)
lm3 <- glm(Churn~tenure + MonthlyCharges + TotalCharges, family = binomial(link = "logit"), data = customer)
anovat <- anova(lm1,lm2,lm3, test="LRT")
anovat
## Analysis of Deviance Table
## 
## Model 1: Churn ~ tenure
## Model 2: Churn ~ tenure + MonthlyCharges
## Model 3: Churn ~ tenure + MonthlyCharges + TotalCharges
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1      7030       7176                         
## 2      7029       6382  1      794   <2e-16 ***
## 3      7028       6376  1        6    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model 2 is significantly better than model 1. However, model 3 is not under 99% significant level. Which means, model 3 may have not many improvements than model 2.

4.3 AUC and ROC Curve

We can use AUC and ROC to measure model 2 and model 3. AUC (Area Under The Curve) - ROC (Receiver Operating Characteristics) curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between customer with left and churn.

We can compare the ROC curves of two models.

prob <- predict(lm2,customer, type = c("response"))
customer$prob <- prob
library(pROC)
g <- roc(Churn ~ prob, data = customer)
plot(g, main = "ROC curve of model 2")

auc(customer$Churn, prob)
prob1 <- predict(lm3,customer, type = c("response"))
customer$prob <- prob1
library(pROC)
g1 <- roc(Churn ~ prob1, data = customer)
plot(g1, main = "ROC curve of model 2")

auc(customer$Churn, prob1)

The two ROC curves are almost same. And the AUC of model 2 is 0.808, which means if we randomly choose a churn customer and a left customer, the probability of ranking churn customer higher than left customer is 0.808. AUC of model 3 is 0.809.

Therefore, totalcharges only makes a little attribution to improve the performance of the classification model. In the 3 continuous variables, we can dismiss the influence of totalcharges to customer churn rates.

We can see the summary of model 2.

xkabledply(lm2, title="summary of model 2")
summary of model 2
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7909 0.0866 -20.7 0
tenure -0.0550 0.0017 -32.5 0
MonthlyCharges 0.0329 0.0013 25.3 0

In the 3 continuous variables, tenure has negative coefficients with churn and monthlycharges has positive coefficients with churn. It means when we have a customer with lower tenure and high monthlycharges, he has more probabilities to churn. And totalcharges is not significant influence factor to customer churn rates in 3 continuous variables.

5 Chapter 5: EDA on the joint effect of influencing factors

From the exploration of each of the above variables, it is known that some of the above variables have a significant effect on customer churn; and from the correlation coefficient plot, it is known that these variables are interrelated. Therefore, we choose to graph the groups of variables with relatively large correlation coefficients (i.e., the absolute value of correlation coefficient is greater than 0.4) one by one to explore how they affect the customer churn rate.

5.1 Simple correlations

Since most of variables are factors, it makes more sense to check their Spearman correlations.

customerNum = customer
# convert categorical variable as numeric for spearman method
for(i in 2:21){
  # tenure, MonthlyCharges, TotalCharges
  if (!(i %in% c(6, 19, 20))){
    customerNum[,i] = as.numeric(customerNum[,i])
  }
}
#str(customerNum)

# corrplot with spearman method for categorical variables
customercor <- cor(subset(customerNum, select=-c(customerID, prob)), method="spearman")
#customercor
loadPkg("corrplot")
#corrplot.mixed(customercor, tl.pos = "lt", number.cex = .5, tl.cex=0.8)
corrplot(customercor, type="lower", addCoef.col="black", number.cex=0.5, tl.cex=0.7,title="Telco Customer Churn Correlation", mar=c(0,0,1,0))

unloadPkg("corrplot")

Larger circle means higher correlation. We can see that churn has negative correlation with contract and tenure, which means that customer who stays longer with the company or has a longer contract terms is less likely to churn. Customer who signed up for online security service and has technical support plan is also less likely to churn. So it makes sense that contract and tech support have positive correlation, which means most customers who signed up for a technical support plan also have longer contract term.

5.2 Will Total Charge influence the churn rate with other variables?

library(patchwork)

tc1 <- ggplot(customer,aes(x = MultipleLines , y = TotalCharges , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("MultipleLines") + ylab("TotalCharges")

tc2 <- ggplot(customer,aes(x = Contract , y = TotalCharges , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Contract") + ylab("TotalCharges")

#tc3 <- ggplot(customer,aes(x = tenure , y = TotalCharges , fill = Churn)) +
#  geom_violin(alpha = 0.5, aes(linetype=NA)) +
#  xlab("Tenure") + ylab("TotalCharges")

tc3 <- ggplot(customer, aes(x=tenure, y=TotalCharges, color=Churn)) +
  geom_point(size=0.6,alpha=0.4)

tc3/ (tc1+tc2) + plot_annotation(title = 'TotalCharges plot')

First, tenure and totalcharge show a positive correlation. However, except for the time when tenure is less than 20, customer churn rate is higher, and the distribution of customer churn samples in other stages is more dispersed.

Second, for Multiplelines, the sample groups of No multiplelines and No phone service have higher churn rates when totalcharge is not high.

Finally, for the sample group of month-to-month contracts, customers are also prone to churn when the totalcharge is low.

5.3 Will Monthly Charge influence the churn rate with other variables?

#m1 <- ggplot(customer,aes(x = TotalCharges , y = MonthlyCharges , fill = Churn)) +
#  geom_violin(alpha = 0.5, aes(linetype=NA)) +
#  xlab("TotalCharges") + ylab("MonthlyCharges")

m1 <- ggplot(customer, aes(x=MonthlyCharges, y=TotalCharges, color=Churn)) +
  geom_point(size=0.1,alpha=0.4)

m2 <- ggplot(customer,aes(x = MultipleLines , y = MonthlyCharges , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("MultipleLines") + ylab("MonthlyCharges")

m1/m2 + plot_annotation(title = 'MonthlyCharges plot')

First, the first two sample groups with the lowest MonthlyCharges had low customer churn and a high number of customers. However, as MonthlyCharges increased after that, customer churn also started to increase, mainly concentrated when TotalCharges were still low.

Second, the relationship between MultipleLines and MonthlyCharges leads to the following three conclusions.

  1. If the customers have no multiple lines service, they are likely to churn if their monthly charge are greater than 75 or between 30 and 50.

  2. If they have no phone service, we will find an interesting result that the more monthly charge they have, the less customer will churn.

  3. And if the customers have multiple lines service, they are like to churn when they have a monthly charge approximately greater than 70.

5.4 Will Tenure influence the churn rate with other variables?

t1 <- ggplot(customer,aes(x = Partner , y = tenure , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Partner") + ylab("Tenure")

t2 <- ggplot(customer,aes(x = Contract , y = tenure , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Contract") + ylab("Tenure")

t3 <- ggplot(customer,aes(x = PaymentMethod , y = tenure , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("PaymentMethod") + ylab("Tenure")

(t1+t2)/t3 + 
  plot_annotation(title = 'Tenure violinplot')

First, the customers who have no partner and short tenure are more likely to churn.

Second, the customers who have a month-to-month contract and short tenure are more likely to churn. Also, when they have long tenure and a 2-year contract, their possibility of churn is obviously increased.

Finally, the payment method is also a factor. The customers who use electronic check or mailed check are more likely to churn if their tenure is short.

5.5 Will Contract influence the churn rate with other variables?

Since the four services OnlineSecurity, OnlineBackup, TechSupport and DeviceProtection and the variable contract are categorical variables, a direct scatter plot can only see nine points in the two-dimensional plane. Therefore, we first numerate these categorical variables and then discretize their values (i.e., add random numbers to the numerical results so that the points are evenly dispersed in the two-dimensional plane).

contract_jitter <- customerNum$Contract*50 +runif(length(customerNum$Contract), -10, 10)
OnlineSecurity_jitter <- customerNum$OnlineSecurity*50 +runif(length(customerNum$OnlineSecurity), -10, 10)
c1 <- ggplot(customer, aes(x=contract_jitter, y=OnlineSecurity_jitter, color=Churn)) +
  geom_point(size=0.01,alpha=0.8)

TechSupport_jitter <- customerNum$TechSupport*50 +runif(length(customerNum$TechSupport), -10, 10)
c2 <- ggplot(customer, aes(x=contract_jitter, y=TechSupport_jitter, color=Churn)) +
  geom_point(size=0.01,alpha=0.8)

OnlineBackupt_jitter <- customerNum$OnlineBackup*50 +runif(length(customerNum$OnlineBackup), -10, 10)
c3 <- ggplot(customer, aes(x=contract_jitter, y=OnlineBackupt_jitter, color=Churn)) +
  geom_point(size=0.01,alpha=0.8)

DeviceProtection_jitter <- customerNum$DeviceProtection*50 +runif(length(customerNum$DeviceProtection), -10, 10)
c4 <- ggplot(customer, aes(x=contract_jitter, y=DeviceProtection_jitter, color=Churn)) +
  geom_point(size=0.01,alpha=0.8)

(c1+c2)/(c3+c4)+plot_annotation(title = 'Contract Plot(wiht jittering)')

These 4 variables (Online Security & Tech Support & Online Backup & Device Protection) are probably influence, because we find that when customer do not have anyone of them, they are more likely to churn if they have a month-to-month contract, except Device Protection. The result shows that there are many people have Device Protection and month-to-month contract but still churn, even though the amount of these customers is less than the one who have no Device Protection.

6 Chapter 6: Model Comparison

6.1 Logistic Regression

The first model will be presented is logistic regression. The dataset has 19 variables, and first thing to do is feature selection. Firstly, we tried to use best glm to select out significant variables, but the variable within the dataset is more than 15. Therefore, we used logistic regression to remove the variables one by one instead of using best glm. We set alpha equals 0.1 which means we would remove the variables with p-value bigger than 0.1. Here is the first table after covering all variables.

logistic_all = glm(Churn ~ gender + SeniorCitizen + Partner + Dependents + tenure
                         + PhoneService + MultipleLines + InternetService + OnlineSecurity
                         + OnlineBackup + DeviceProtection + TechSupport + StreamingTV
                         + StreamingMovies + Contract + PaperlessBilling + PaymentMethod
                         + MonthlyCharges + TotalCharges, data = customer, family = 'binomial')
summary(logistic_all)
xkabledply(logistic_all)

Delete gender

logistic_1 = glm(Churn ~ SeniorCitizen + Partner + Dependents + tenure
                         + PhoneService + MultipleLines + InternetService + OnlineSecurity
                         + OnlineBackup + DeviceProtection + TechSupport + StreamingTV
                         + StreamingMovies + Contract + PaperlessBilling + PaymentMethod
                         + MonthlyCharges + TotalCharges, data = customer, family = 'binomial')
summary(logistic_1)
xkabledply(logistic_1)

Delete Partner

logistic_2 = glm(Churn ~ SeniorCitizen + Dependents + tenure
                         + PhoneService + MultipleLines + InternetService + OnlineSecurity
                         + OnlineBackup + DeviceProtection + TechSupport + StreamingTV
                         + StreamingMovies + Contract + PaperlessBilling + PaymentMethod
                         + MonthlyCharges + TotalCharges, data = customer, family = 'binomial')
summary(logistic_2)
xkabledply(logistic_2)

Delete PhoneService

logistic_3 = glm(Churn ~ SeniorCitizen + Dependents + tenure
                         + MultipleLines + InternetService + OnlineSecurity
                         + OnlineBackup + DeviceProtection + TechSupport + StreamingTV
                         + StreamingMovies + Contract + PaperlessBilling + PaymentMethod
                         + MonthlyCharges + TotalCharges, data = customer, family = 'binomial')
summary(logistic_3)
xkabledply(logistic_3)

Delete MultipleLines

logistic_4 = glm(Churn ~ SeniorCitizen + Dependents + tenure
                         + InternetService + OnlineSecurity
                         + OnlineBackup + DeviceProtection + TechSupport + StreamingTV
                         + StreamingMovies + Contract + PaperlessBilling + PaymentMethod
                         + MonthlyCharges + TotalCharges, data = customer, family = 'binomial')
summary(logistic_4)
xkabledply(logistic_4)

Delete OnlineBackup

logistic_5 = glm(Churn ~ SeniorCitizen + Dependents + tenure
                         + InternetService + OnlineSecurity
                         + DeviceProtection + TechSupport + StreamingTV
                         + StreamingMovies + Contract + PaperlessBilling + PaymentMethod
                         + MonthlyCharges + TotalCharges, data = customer, family = 'binomial')
summary(logistic_5)
xkabledply(logistic_5)

Delete DeviceProtection

logistic_6 = glm(Churn ~ SeniorCitizen + Dependents + tenure
                         + InternetService + OnlineSecurity
                         + TechSupport + StreamingTV
                         + StreamingMovies + Contract + PaperlessBilling + PaymentMethod
                         + MonthlyCharges + TotalCharges, data = customer, family = 'binomial')
summary(logistic_6)
xkabledply(logistic_6)

Delete PaymentMethod

logistic_7 = glm(Churn ~ SeniorCitizen + Dependents + tenure
                         + InternetService + OnlineSecurity
                         + TechSupport + StreamingTV
                         + StreamingMovies + Contract + PaperlessBilling
                         + MonthlyCharges + TotalCharges, data = customer, family = 'binomial')
summary(logistic_7)
xkabledply(logistic_7)

The method is prolix, but fortune we got the final model after doing seven times. There are seven variables removed from the dataset. They have p-value bigger than 0.1. They are gender, partner, phone service, multiple lines, online backup, device protection, and payment method. Right now we can do next steps to analyze the models.

6.1.1 Coefficient Exponential

expcoeff1 = exp(coef(logistic_7))
summary(expcoeff1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.24    0.71    0.98    1.16    1.49    3.64       4
xkabledply( as.table(expcoeff1), title = "Exponential of coefficients in Churn" )
Exponential of coefficients in Churn
x
(Intercept) 1.741
SeniorCitizenYes 1.302
DependentsYes 0.850
tenure 0.943
InternetServiceFiber optic 3.639
InternetServiceNo 0.242
OnlineSecurityNo internet service NA
OnlineSecurityYes 0.709
TechSupportNo internet service NA
TechSupportYes 0.718
StreamingTVNo internet service NA
StreamingTVYes 1.535
StreamingMoviesNo internet service NA
StreamingMoviesYes 1.545
ContractOne year 0.482
ContractTwo year 0.238
PaperlessBillingYes 1.454
MonthlyCharges 0.981
TotalCharges 1.000

For exponential of coefficients, there are five variables having positive effect on dependent variables Churn, and six having negative effect. For example, if people have tech support, they will have more possibility to remain instead of leaving.

6.1.2 Confusion matrix

loadPkg("regclass")
confusion_matrix(logistic_7)
##            Predicted No Predicted Yes Total
## Actual No          4619           544  5163
## Actual Yes          836          1033  1869
## Total              5455          1577  7032
xkabledply( confusion_matrix(logistic_7), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted No Predicted Yes Total
Actual No 4619 544 5163
Actual Yes 836 1033 1869
Total 5455 1577 7032
unloadPkg("regclass")
loadPkg("regclass")
cfmatrix1 = confusion_matrix(logistic_7)
accuracy1 <- (cfmatrix1[1,1]+cfmatrix1[2,2])/cfmatrix1[3,3]
precision1 <- cfmatrix1[2,2]/(cfmatrix1[2,2]+cfmatrix1[1,2])
recall1 <- cfmatrix1[2,2]/(cfmatrix1[2,2]+cfmatrix1[2,1])
specificity1 <- cfmatrix1[1,1]/(cfmatrix1[1,1]+cfmatrix1[1,2])
F1_score1 <- 2*(precision1)*(recall1)/(precision1 + recall1)
accuracy1
precision1
recall1
specificity1
F1_score1

From confusion matrix, we can conclude that

  • Accuracy =0.804
  • Precision=0.655
  • Recall=0.553
  • Specificity=0.895
  • F1 score=0.6

6.1.3 ROC and AUC

loadPkg("pROC") 
prob=predict(logistic_7, type = "response" )
customer$prob=prob
h = roc(Churn~prob, data=customer)
auc(h) 
## Area under the curve: 0.846
plot(h)

We have here the area-under-curve of 0.846, which is greater than 0.8. The model is considered a good fit.

6.1.4 McFadden

loadPkg("pscl")
ChurnLogitpr2 = pR2(logistic_7)
## fitting null model for pseudo-r2
ChurnLogitpr2
##       llh   llhNull        G2  McFadden      r2ML      r2CU 
## -2937.855 -4071.678  2267.645     0.278     0.276     0.402
unloadPkg("pscl") 

With the McFadden value of 0.278, which is analogous to the coefficient of determination \(R^2\), only about 27.8% of the variations in y is explained by the explanatory variables in the model.

6.2 KNN

We could use K-nearest-neighbor algorithm to predict whether customer X will churn, by looking at X’s K neighbors who share similar attributes with him.

6.2.1 Center and Scale

fullScale = customer_final
fullScale[c(5, 18, 19)]<- scale(fullScale[c(5, 18, 19)], center = TRUE, scale = TRUE)
str(fullScale)
set.seed(1)
customer_sampe <- sample(2, nrow(fullScale), replace=TRUE, prob=c(0.75, 0.25))
cus_train_full <- fullScale[customer_sampe==1, 1:19]
cus_test_full <- fullScale[customer_sampe==2, 1:19]
# y
cus_train_full_y <- fullScale[customer_sampe==1, 20]
cus_test_full_y <- fullScale[customer_sampe==2, 20]

str(cus_train_full)
str(cus_test_full)

In order to use KNN with our dataset, we convert all categorical variables to numeric first. Notice that the categorical variables now have values in range [1, 4] while the numerical variables have a various range and they are much larger than [1, 4]. This is where we need scaling. So let’s scale these numerical variables, and our dataset now looks like:

str(fullScale)
## 'data.frame':    7032 obs. of  21 variables:
##  $ gender          : num  1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : num  2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : num  1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  -1.2802 0.0643 -1.2394 0.5124 -1.2394 ...
##  $ PhoneService    : num  1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : num  2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : num  1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : num  1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : num  3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: num  1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : num  1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : num  1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : num  1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : num  1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: num  2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : num  3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  -1.162 -0.261 -0.364 -0.748 0.196 ...
##  $ TotalCharges    : num  -0.994 -0.174 -0.96 -0.195 -0.94 ...
##  $ Churn           : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 1 1 2 1 ...
##  $ prob            : num  0.5754 0.0472 0.3689 0.0237 0.6875 ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
##   ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...

6.2.2 KNN with all variables

At first, let’s start modeling with all 19 variables. Make a train test split for our scaled dataset at 3:1, where the train set has 5185 observations and test set has 1847 observations.

6.2.3 Select K

What should be our best k value to build KNN model?

knn_full_different_k = sapply(seq(1, 21, by = 2),  #<- set k to be odd number from 1 to 21
                         function(x) chooseK(x, 
                                             train_set = cus_train_full,
                                             val_set = cus_test_full,
                                             train_class = cus_train_full_y,
                                             val_class = cus_test_full_y))

str(knn_full_different_k)
knn_full_different_k = data.frame(k = knn_full_different_k[1,],
                             accuracy = knn_full_different_k[2,])

library("ggplot2")

ggplot(knn_full_different_k,
       aes(x = k, y = accuracy)) +
  geom_line(color = "orange", size = 1.5) +
  geom_point(size = 3) + 
  labs(title = "accuracy vs k")

xkabledply((knn_full_different_k))
Model: k ~ accuracy
k accuracy
1 0.720
3 0.752
5 0.772
7 0.781
9 0.781
11 0.775
13 0.780
15 0.787
17 0.787
19 0.787
21 0.786

It seems that k=15 should be our best selection here, because it has the most improvement before accuracy approaching 0.787.

pred_full <- knn(train = cus_train_full, test = cus_test_full, cl=cus_train_full_y, k=15)
pred_full

6.2.4 Evaluation

After we build the model, let’s look at its result:

loadPkg("gmodels")
churnPredCross <- CrossTable(cus_test_full_y, pred_full, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1847 
## 
##  
##                 | pred_full 
## cus_test_full_y |         1 |         2 | Row Total | 
## ----------------|-----------|-----------|-----------|
##               1 |      1196 |       169 |      1365 | 
##                 |     0.876 |     0.124 |     0.739 | 
##                 |     0.841 |     0.398 |           | 
##                 |     0.648 |     0.091 |           | 
## ----------------|-----------|-----------|-----------|
##               2 |       226 |       256 |       482 | 
##                 |     0.469 |     0.531 |     0.261 | 
##                 |     0.159 |     0.602 |           | 
##                 |     0.122 |     0.139 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |      1422 |       425 |      1847 | 
##                 |     0.770 |     0.230 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Over the 1847 test observations, this model correctly predicts 1196 customers will not churn and 256 correctly churn.

  • Accuracy = 0.776
  • Precision = 0.583
  • Recall = 0.498
  • Specification = 0.874
  • F1 Score = 0.537
    The overall accuracy of 77.6%, which is not bad, but the recall rate is less than 50%, meaning we even don’t have half a chance of predicting correctly if this customer indeed churned.
    So could we build a better model?

6.2.5 KNN with selected variables

Let’s consider building our KNN model with fewer variables.

cusKNN <- subset(customer_final, select=c(SeniorCitizen, Partner, Dependents, PaperlessBilling, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, Contract, tenure, MonthlyCharges, Churn))
str(cusKNN)
cus_scale = cusKNN
cus_scale[10:11]<- scale(cusKNN[10:11], center = TRUE, scale = TRUE)
cus_scale$Churn <- customer_final$Churn
str(cus_scale)

Select 11 variables that have significant affect according to previous EDA section:

str(cus_scale)
## 'data.frame':    7032 obs. of  12 variables:
##  $ SeniorCitizen   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : num  2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : num  1 1 1 1 1 1 2 1 1 2 ...
##  $ PaperlessBilling: num  2 1 2 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : num  1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : num  3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: num  1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : num  1 1 1 3 1 1 1 1 3 1 ...
##  $ Contract        : num  1 2 1 2 1 1 1 1 1 2 ...
##  $ tenure          : num  -1.2802 0.0643 -1.2394 0.5124 -1.2394 ...
##  $ MonthlyCharges  : num  -1.162 -0.261 -0.364 -0.748 0.196 ...
##  $ Churn           : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 1 1 2 1 ...
set.seed(1)
customer_sampe <- sample(2, nrow(cus_scale), replace=TRUE, prob=c(0.75, 0.25))
cus_train <- cus_scale[customer_sampe==1, 1:11]
cus_test <- cus_scale[customer_sampe==2, 1:11]
# y
cus_train_y <- cus_scale[customer_sampe==1, 12]
cus_test_y <- cus_scale[customer_sampe==2, 12]
str(cus_train)
str(cus_test)

6.2.6 Select K

Repeat the same scaling and train test split steps as above.
Now select our best k value.

knn_different_k = sapply(seq(1, 21, by = 2),  #<- set k to be odd number from 1 to 21
                         function(x) chooseK(x, 
                                             train_set = cus_train,
                                             val_set = cus_test,
                                             train_class = cus_train_y,
                                             val_class = cus_test_y))

# Reformat the results to graph the results.
str(knn_different_k)
knn_different_k = data.frame(k = knn_different_k[1,],
                             accuracy = knn_different_k[2,])

# Plot accuracy vs. k.
# install.packages("ggplot2")
library("ggplot2")

ggplot(knn_different_k,
       aes(x = k, y = accuracy)) +
  geom_line(color = "orange", size = 1.5) +
  geom_point(size = 3) + 
  labs(title = "accuracy vs k")

xkabledply((knn_different_k))
Model: k ~ accuracy
k accuracy
1 0.747
3 0.768
5 0.781
7 0.779
9 0.782
11 0.788
13 0.784
15 0.787
17 0.782
19 0.786
21 0.781

We should better select value k=11 here as it has the highest accuracy.

pred <- knn(train = cus_train, test = cus_test, cl=cus_train_y, k=11)
knn.roc.prob <- attr(knn(train = cus_train, test = cus_test, cl=cus_train_y, k=11,prob = T),'prob')
pred

6.2.7 Evaluation

library("gmodels")
churnPredCross <- CrossTable(cus_test_y, pred, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1847 
## 
##  
##              | pred 
##   cus_test_y |         1 |         2 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |      1225 |       140 |      1365 | 
##              |     0.897 |     0.103 |     0.739 | 
##              |     0.829 |     0.378 |           | 
##              |     0.663 |     0.076 |           | 
## -------------|-----------|-----------|-----------|
##            2 |       252 |       230 |       482 | 
##              |     0.523 |     0.477 |     0.261 | 
##              |     0.171 |     0.622 |           | 
##              |     0.136 |     0.125 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1477 |       370 |      1847 | 
##              |     0.800 |     0.200 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
  • Accuracy = 0.791
  • Precision = 0.630
  • Recall = 0.481
  • Specification = 0.9
  • F1 Score = 0.546
    The overall accuracy improves a bit than previous KNN model, but the recall rate does not improve. One possible reason here is our dataset is unbalanced – much more not churned customers than churned customers. To fix unbalance, one should explore a better cut-off value.

6.2.8 Comparison

tab <- matrix(c(0.776,0.583,0.498,0.537,0.791,0.630,0.481,0.546,0.804,0.655,0.553,0.6), ncol=4, byrow = TRUE)
colnames(tab) <- c("Accuracy", "Precision", "Recall", "F-1")
rownames(tab) <- c("KNN with all variables", "KNN with selected variables", "Logistic with selected variables")
tab <- as.table(tab)
xkabledply(tab, "Models Comparison")
Models Comparison
Accuracy Precision Recall F-1
KNN with all variables 0.776 0.583 0.498 0.537
KNN with selected variables 0.791 0.630 0.481 0.546
Logistic with selected variables 0.804 0.655 0.553 0.600

A comparison between both KNN models and the Logistic models shows clearly the Logistic model is generally better than KNN in this case by higher scores in both accuracy and recall rate.

unloadPkg("class")
unloadPkg("gmodels")

6.3 Classification Tree

The third model is Classification Tree model. We could use the classification tree model to predict the customer churn or not churn visually. It is very convenient. And we will discuss the performance of this model below.

First step, cleaning the dataset for preparing to build the model. Convert the categorical variables to numerical variables. And delete the CustomerID, which has no relationship to the customer churn rate.

customerNum = customer
# convert categorical variable as numeric 
for(i in 2:20){
  # tenure, MonthlyCharges, TotalCharges
  if (!(i %in% c(6, 19, 20))){
    customerNum[,i] = as.numeric(customerNum[,i])
  }
}
customerNum <- subset(customerNum, select = -customerID)

Then we did the feature selection. We created a list of feature importance based on mean decreasing gini of all the features. And as we can see from the picture, we choosed the top 6 features to build this classification tree model.

library(randomForest)
fit_im = randomForest(customerNum$Churn~., data=customerNum)
# Create an importance based on mean decreasing gini
importance(fit_im)
##                  MeanDecreaseGini
## gender                       52.8
## SeniorCitizen                38.1
## Partner                      46.3
## Dependents                   39.0
## tenure                      292.3
## PhoneService                 11.3
## MultipleLines                47.7
## InternetService              40.0
## OnlineSecurity               83.9
## OnlineBackup                 57.6
## DeviceProtection             48.8
## TechSupport                  83.1
## StreamingTV                  32.0
## StreamingMovies              33.4
## Contract                    147.5
## PaperlessBilling             43.6
## PaymentMethod                98.1
## MonthlyCharges              326.9
## TotalCharges                348.4
## prob                        645.1
varImpPlot(fit_im)

Next, we try to find the best depths in this model. We tried different depths and summary the results

loadPkg("rpart")
loadPkg("caret")



# create an empty dataframe to store the results from confusion matrices
confusionMatrixResultDf = data.frame( Depth=numeric(0), Accuracy= numeric(0), Sensitivity=numeric(0), Specificity=numeric(0), Pos.Pred.Value=numeric(0), Neg.Pred.Value=numeric(0), Precision=numeric(0), Recall=numeric(0), F1=numeric(0), Prevalence=numeric(0), Detection.Rate=numeric(0), Detection.Prevalence=numeric(0), Balanced.Accuracy=numeric(0), row.names = NULL )

for (deep in 2:8) {
  kfit <- rpart(Churn ~ TotalCharges + MonthlyCharges + tenure + Contract +  OnlineSecurity + PaymentMethod, data=customerNum, method="class", control = list(maxdepth = deep) )
  # 
  cm = confusionMatrix( predict(kfit, type = "class"), reference = customerNum[, "Churn"] ) # from caret library
  # 
  cmaccu = cm$overall['Accuracy']
  # print( paste("Total Accuracy = ", cmaccu ) )
  # 
  cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics 
  cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
  confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
  # print("Other metrics : ")
}

unloadPkg("caret")

As we can see from the results, from the depths 5, the accuracy of different depths is almost same. Therefore, we choosed the depths 5 to build the tree model.

xkabledply(confusionMatrixResultDf, title="Churn Classification Trees summary with varying MaxDepth")
Churn Classification Trees summary with varying MaxDepth
Depth Accuracy Sensitivity Specificity Pos.Pred.Value Neg.Pred.Value Precision Recall F1 Prevalence Detection.Rate Detection.Prevalence Balanced.Accuracy
2 0.742 0.750 0.719 0.880 0.510 0.880 0.750 0.810 0.734 0.551 0.626 0.735
3 0.781 0.930 0.370 0.803 0.657 0.803 0.930 0.862 0.734 0.683 0.851 0.650
4 0.781 0.930 0.370 0.803 0.657 0.803 0.930 0.862 0.734 0.683 0.851 0.650
5 0.793 0.908 0.473 0.827 0.651 0.827 0.908 0.865 0.734 0.667 0.807 0.691
6 0.793 0.908 0.473 0.827 0.651 0.827 0.908 0.865 0.734 0.667 0.807 0.691
7 0.793 0.908 0.473 0.827 0.651 0.827 0.908 0.865 0.734 0.667 0.807 0.691
8 0.793 0.908 0.473 0.827 0.651 0.827 0.908 0.865 0.734 0.667 0.807 0.691

Then we built this classification tree model and plot it.

set.seed(1)
Churnfit <- rpart(Churn ~ TotalCharges + MonthlyCharges + tenure + Contract +  OnlineSecurity + PaymentMethod, data=customerNum, method="class", control = list(maxdepth = 5) )

printcp(Churnfit) # display the results 
## 
## Classification tree:
## rpart(formula = Churn ~ TotalCharges + MonthlyCharges + tenure + 
##     Contract + OnlineSecurity + PaymentMethod, data = customerNum, 
##     method = "class", control = list(maxdepth = 5))
## 
## Variables actually used in tree construction:
## [1] Contract       MonthlyCharges OnlineSecurity tenure        
## 
## Root node error: 1869/7032 = 0.3
## 
## n= 7032 
## 
##     CP nsplit rel error xerror xstd
## 1 0.06      0       1.0    1.0 0.02
## 2 0.02      3       0.8    0.8 0.02
## 3 0.01      5       0.8    0.8 0.02
plotcp(Churnfit) # visualize cross-validation results 

summary(Churnfit) # detailed summary of splits
## Call:
## rpart(formula = Churn ~ TotalCharges + MonthlyCharges + tenure + 
##     Contract + OnlineSecurity + PaymentMethod, data = customerNum, 
##     method = "class", control = list(maxdepth = 5))
##   n= 7032 
## 
##       CP nsplit rel error xerror   xstd
## 1 0.0590      0     1.000  1.000 0.0198
## 2 0.0214      3     0.823  0.841 0.0187
## 3 0.0100      5     0.780  0.818 0.0185
## 
## Variable importance
##       Contract         tenure OnlineSecurity   TotalCharges MonthlyCharges 
##             29             21             17             16              9 
##  PaymentMethod 
##              8 
## 
## Node number 1: 7032 observations,    complexity param=0.059
##   predicted class=No   expected loss=0.266  P(node) =1
##     class counts:  5163  1869
##    probabilities: 0.734 0.266 
##   left son=2 (3157 obs) right son=3 (3875 obs)
##   Primary splits:
##       Contract       < 1.5  to the right, improve=449, (0 missing)
##       OnlineSecurity < 1.5  to the right, improve=321, (0 missing)
##       tenure         < 16.5 to the right, improve=295, (0 missing)
##       TotalCharges   < 198  to the right, improve=141, (0 missing)
##       MonthlyCharges < 68.8 to the left,  improve=132, (0 missing)
##   Surrogate splits:
##       tenure         < 34.5 to the right, agree=0.792, adj=0.536, (0 split)
##       OnlineSecurity < 1.5  to the right, agree=0.700, adj=0.332, (0 split)
##       TotalCharges   < 3000 to the right, agree=0.694, adj=0.319, (0 split)
##       PaymentMethod  < 2.5  to the left,  agree=0.665, adj=0.253, (0 split)
##       MonthlyCharges < 27.7 to the left,  agree=0.608, adj=0.127, (0 split)
## 
## Node number 2: 3157 observations
##   predicted class=No   expected loss=0.0678  P(node) =0.449
##     class counts:  2943   214
##    probabilities: 0.932 0.068 
## 
## Node number 3: 3875 observations,    complexity param=0.059
##   predicted class=No   expected loss=0.427  P(node) =0.551
##     class counts:  2220  1655
##    probabilities: 0.573 0.427 
##   left son=6 (1244 obs) right son=7 (2631 obs)
##   Primary splits:
##       OnlineSecurity < 1.5  to the right, improve=114.0, (0 missing)
##       MonthlyCharges < 68.6 to the left,  improve=104.0, (0 missing)
##       tenure         < 5.5  to the right, improve= 72.1, (0 missing)
##       TotalCharges   < 104  to the right, improve= 30.9, (0 missing)
##       PaymentMethod  < 3.5  to the right, improve= 28.8, (0 missing)
##   Surrogate splits:
##       MonthlyCharges < 24.1 to the left,  agree=0.804, adj=0.389, (0 split)
##       TotalCharges   < 24.1 to the left,  agree=0.719, adj=0.124, (0 split)
##       PaymentMethod  < 3.5  to the right, agree=0.692, adj=0.041, (0 split)
##       tenure         < 69.5 to the right, agree=0.679, adj=0.001, (0 split)
## 
## Node number 6: 1244 observations
##   predicted class=No   expected loss=0.251  P(node) =0.177
##     class counts:   932   312
##    probabilities: 0.749 0.251 
## 
## Node number 7: 2631 observations,    complexity param=0.059
##   predicted class=Yes  expected loss=0.49  P(node) =0.374
##     class counts:  1288  1343
##    probabilities: 0.490 0.510 
##   left son=14 (1580 obs) right son=15 (1051 obs)
##   Primary splits:
##       tenure         < 7.5  to the right, improve=75.7, (0 missing)
##       TotalCharges   < 348  to the right, improve=57.6, (0 missing)
##       MonthlyCharges < 69.2 to the left,  improve=41.8, (0 missing)
##       PaymentMethod  < 2.5  to the left,  improve=23.3, (0 missing)
##   Surrogate splits:
##       TotalCharges   < 548  to the right, agree=0.948, adj=0.871, (0 split)
##       MonthlyCharges < 75.2 to the right, agree=0.639, adj=0.095, (0 split)
##       PaymentMethod  < 3.5  to the left,  agree=0.633, adj=0.081, (0 split)
## 
## Node number 14: 1580 observations,    complexity param=0.0214
##   predicted class=No   expected loss=0.413  P(node) =0.225
##     class counts:   928   652
##    probabilities: 0.587 0.413 
##   left son=28 (505 obs) right son=29 (1075 obs)
##   Primary splits:
##       MonthlyCharges < 73.8 to the left,  improve=39.5, (0 missing)
##       tenure         < 16.5 to the right, improve=11.3, (0 missing)
##       PaymentMethod  < 2.5  to the left,  improve=10.7, (0 missing)
##       TotalCharges   < 660  to the left,  improve= 5.7, (0 missing)
##   Surrogate splits:
##       TotalCharges  < 741  to the left,  agree=0.770, adj=0.279, (0 split)
##       PaymentMethod < 3.5  to the right, agree=0.712, adj=0.099, (0 split)
## 
## Node number 15: 1051 observations
##   predicted class=Yes  expected loss=0.343  P(node) =0.149
##     class counts:   360   691
##    probabilities: 0.343 0.657 
## 
## Node number 28: 505 observations
##   predicted class=No   expected loss=0.25  P(node) =0.0718
##     class counts:   379   126
##    probabilities: 0.750 0.250 
## 
## Node number 29: 1075 observations,    complexity param=0.0214
##   predicted class=No   expected loss=0.489  P(node) =0.153
##     class counts:   549   526
##    probabilities: 0.511 0.489 
##   left son=58 (767 obs) right son=59 (308 obs)
##   Primary splits:
##       tenure         < 16.5 to the right, improve=17.10, (0 missing)
##       TotalCharges   < 1540 to the right, improve=16.10, (0 missing)
##       PaymentMethod  < 2.5  to the left,  improve= 9.82, (0 missing)
##       MonthlyCharges < 96   to the left,  improve= 4.47, (0 missing)
##   Surrogate splits:
##       TotalCharges   < 1450 to the right, agree=0.967, adj=0.886, (0 split)
##       MonthlyCharges < 75.3 to the right, agree=0.718, adj=0.016, (0 split)
## 
## Node number 58: 767 observations
##   predicted class=No   expected loss=0.433  P(node) =0.109
##     class counts:   435   332
##    probabilities: 0.567 0.433 
## 
## Node number 59: 308 observations
##   predicted class=Yes  expected loss=0.37  P(node) =0.0438
##     class counts:   114   194
##    probabilities: 0.370 0.630
# plot tree 
plot(Churnfit, uniform=TRUE, main="Classification Tree for Churn")
text(Churnfit, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postcript plot of tree 
post(Churnfit, file = "ChurnTree2.ps", title = "Classification Tree for Churn")

6.3.1 Results and confusion matrix

And these are the summary results of this model and its confusion matrix.

loadPkg("caret") 
cm = confusionMatrix( predict(Churnfit, type = "class") , reference = customerNum[, "Churn"])
print('Overall: ')
## [1] "Overall: "
cm$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##       7.93e-01       4.18e-01       7.83e-01       8.02e-01       7.34e-01 
## AccuracyPValue  McnemarPValue 
##       2.84e-30       1.54e-40
print('Class: ')
## [1] "Class: "
cm$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##                0.908                0.474                0.827 
##       Neg Pred Value            Precision               Recall 
##                0.651                0.827                0.908 
##                   F1           Prevalence       Detection Rate 
##                0.865                0.734                0.667 
## Detection Prevalence    Balanced Accuracy 
##                0.807                0.691
unloadPkg("caret")
xkabledply(cm$table, "confusion matrix")
confusion matrix
No Yes
No 4689 984
Yes 474 885

6.3.2 Tree plot

We also tried two other ways to plot the tree, with library rpart.plot and a “fancy” plot using the library rattle to make the tree plot more beautifully.

loadPkg("rpart.plot")
rpart.plot(Churnfit)

loadPkg("rattle") 
fancyRpartPlot(Churnfit)

We could also prune the tree.

#prune the tree 
Churnfit <- prune(Churnfit, cp = Churnfit$cptable[2,"CP"])

# plot the pruned tree 
fancyRpartPlot(Churnfit)

# For boring plot, use codes below instead
plot(Churnfit, uniform=TRUE, main="Pruned Classification Tree for Churn")
text(Churnfit, use.n=TRUE, all=TRUE, cex=.8)

6.3.3 ROC curve

Finally, we also checked the ROC and AUC of this classification tree model. As we can see, this model is kind of a good model.

library(rpart)
rp <- rpart(Churn ~ ., data = customerNum)
library(ROCR)
pred <- prediction(predict(Churnfit, type = "prob")[, 2], customerNum$Churn)
tree.predict.prob <- predict(Churnfit, type = "prob")[, 2]
plot(performance(pred, "tpr", "fpr"), main = "ROC Churn")
auc = performance(pred, 'auc')
slot(auc, 'y.values')
## [[1]]
## [1] 0.793
abline(0, 1, lty = 2)

6.4 SVM

Firstly, we need to load the package for SVM model training.

# install.packages(tidyverse)
# install.packages(kernlab)
# install.packages(e1071)
# install.packages(caTools)
library(tidyverse)#SVM
library(kernlab)#SVM
library(e1071)#SVM
library(caTools)#Split Data into Test and Train Set
library(ModelMetrics)#confusion matrix
library(ROCR)#ROC plot

Then, we set up 75% of total data as the training set, and the rest of them are test set.

dat=subset(customer,select=-c(customerID))
set.seed(123)
split <- sample.split(dat, SplitRatio = 0.75)
# split
train_svm <- subset(dat, split == "TRUE")
test_svm <- subset(dat, split == "FALSE")

In this case, we decided to use tune function from e1071 package to train SVM model. There are 4 types of kernel we can use: radial, linear, polynomial and sigmoid. In fact, running these models do cost a lot of time, so we have to change the sample amount of our training set. To save the running time and quickly check the process, we tried to use the less data (200 sample here only) here for showing the template of our process.

6.4.1 SVM with radial kernel

The best parameters:

# tune model to find optimal cost, gamma values
tune.out <- tune(svm, Churn~., data = train_svm[1:200,], kernel = "radial",
                 ranges = list(cost = c(0.1,1,10,100,1000),
                 gamma = c(0.5,1,2,3,4)))
# show best model
tune.out$best.model
## 
## Call:
## best.tune(method = svm, train.x = Churn ~ ., data = train_svm[1:200, 
##     ], ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5, 
##     1, 2, 3, 4)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  186

The parameter of best svm model with radial basis kernel is

tune.out$best.parameters
##   cost gamma
## 3   10   0.5
best_svmfit <- svm(Churn~., data = train_svm, kernel = "radial", gamma = 0.5, cost = 1,probability = TRUE)

6.4.2 SVM with linear kernel

The best parameters:

# tune model to find optimal cost, gamma values
tune.out.linear <- tune(svm, Churn~., data = train_svm[1:200,], kernel = "linear",
                 ranges = list(cost = c(0.1,1,10,100,1000)))
# show best model
tune.out.linear$best.model
## 
## Call:
## best.tune(method = svm, train.x = Churn ~ ., data = train_svm[1:200, 
##     ], ranges = list(cost = c(0.1, 1, 10, 100, 1000)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  78

The parameter of best svm model with linear kernel is

tune.out.linear$best.parameters
##   cost
## 3   10
best_svm_linear_fit <- svm(Churn~., data = train_svm, kernel = "linear", cost = 10,probability = TRUE)

6.4.3 SVM with polynomial kernel

The best parameters:

# tune model to find optimal cost, gamma values
tune.out.polynomial <- tune(svm, Churn~., data = train_svm[1:200,], kernel = "polynomial",
                 ranges = list(cost = c(0.1,1,10,100,1000),
                               gamma = c(0.5,1,2,3,4)))
# show best model
tune.out.polynomial$best.model
## 
## Call:
## best.tune(method = svm, train.x = Churn ~ ., data = train_svm[1:200, 
##     ], ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5, 
##     1, 2, 3, 4)), kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  91

The parameter of best svm model with polynomial kernel is:

tune.out.polynomial$best.parameters
##   cost gamma
## 1  0.1   0.5
best_svm_polynomial_fit <- svm(Churn~., data = train_svm, kernel = "polynomial", gamma = 0.1,cost = 0.5,probability = TRUE)

6.4.4 SVM with sigmoid kernel

The best parameters:

# tune model to find optimal cost, gamma values
tune.out.sigmoid <- tune(svm, Churn~., data = train_svm[1:200,], kernel = "sigmoid",
                 ranges = list(cost = c(0.1,1,10,100,1000),
                               gamma = c(0.5,1,2,3,4)))
# show best model
tune.out.sigmoid$best.model
## 
## Call:
## best.tune(method = svm, train.x = Churn ~ ., data = train_svm[1:200, 
##     ], ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5, 
##     1, 2, 3, 4)), kernel = "sigmoid")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  0.1 
##      coef.0:  0 
## 
## Number of Support Vectors:  110

The parameter of best svm model with sigmoid kernel is

tune.out.sigmoid$best.parameters
##   cost gamma
## 1  0.1   0.5
best_svm_sigmoid_fit <- svm(Churn~., data = train_svm, kernel = "sigmoid", gamma = 0.5, cost = 0.1,probability = TRUE)

6.4.5 Evaluation for SVM models

After training the best model, we try use the test set to caculate the confusion matrix and relative evaluation score like accuracy, recall rate, F1 socre and so on.

These are the confusion matrix of these SVM model. The results below confusion matrix showed the evaluation score of these models.

SVM with radial kernel:

library(gmodels)
c.radial <- CrossTable(test_svm$Churn, predict(best_svmfit,test_svm), prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2008 
## 
##  
##                | predict(best_svmfit, test_svm) 
## test_svm$Churn |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |      1342 |       136 |      1478 | 
##                |     0.908 |     0.092 |     0.736 | 
##                |     0.836 |     0.337 |           | 
##                |     0.668 |     0.068 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |       263 |       267 |       530 | 
##                |     0.496 |     0.504 |     0.264 | 
##                |     0.164 |     0.663 |           | 
##                |     0.131 |     0.133 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |      1605 |       403 |      2008 | 
##                |     0.799 |     0.201 |           | 
## ---------------|-----------|-----------|-----------|
## 
## 
# validate model performance
# valid <- table(true = test_svm$Churn, pred = predict(best_svmfit,test_svm))
# valid
#method 2(confusion matrix)
loadPkg("caret")

cm_radial = confusionMatrix( predict(best_svmfit,test_svm), reference = test_svm$Churn )
print('Overall of SVM radial kernel: ')
## [1] "Overall of SVM radial kernel: "
cm_radial$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##       8.01e-01       4.46e-01       7.83e-01       8.19e-01       7.36e-01 
## AccuracyPValue  McnemarPValue 
##       5.10e-12       2.83e-10
print('Class of SVM radial kernel: ')
## [1] "Class of SVM radial kernel: "
cm_radial$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##                0.908                0.504                0.836 
##       Neg Pred Value            Precision               Recall 
##                0.663                0.836                0.908 
##                   F1           Prevalence       Detection Rate 
##                0.871                0.736                0.668 
## Detection Prevalence    Balanced Accuracy 
##                0.799                0.706

SVM with linear kernel:

c.linear <- CrossTable(test_svm$Churn, predict(best_svm_linear_fit,test_svm), prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2008 
## 
##  
##                | predict(best_svm_linear_fit, test_svm) 
## test_svm$Churn |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |      1320 |       158 |      1478 | 
##                |     0.893 |     0.107 |     0.736 | 
##                |     0.854 |     0.341 |           | 
##                |     0.657 |     0.079 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |       225 |       305 |       530 | 
##                |     0.425 |     0.575 |     0.264 | 
##                |     0.146 |     0.659 |           | 
##                |     0.112 |     0.152 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |      1545 |       463 |      2008 | 
##                |     0.769 |     0.231 |           | 
## ---------------|-----------|-----------|-----------|
## 
## 
cm_linear = confusionMatrix( predict(best_svm_linear_fit,test_svm), reference = test_svm$Churn )
print('Overall of SVM linear kernel: ')
## [1] "Overall of SVM linear kernel: "
cm_linear$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##       8.09e-01       4.88e-01       7.91e-01       8.26e-01       7.36e-01 
## AccuracyPValue  McnemarPValue 
##       8.54e-15       7.45e-04
print('Class of SVM linear kernel: ')
## [1] "Class of SVM linear kernel: "
cm_linear$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##                0.893                0.575                0.854 
##       Neg Pred Value            Precision               Recall 
##                0.659                0.854                0.893 
##                   F1           Prevalence       Detection Rate 
##                0.873                0.736                0.657 
## Detection Prevalence    Balanced Accuracy 
##                0.769                0.734

SVM with polynomial kernel:

c.polynomial <- CrossTable(test_svm$Churn, predict(best_svm_polynomial_fit,test_svm), prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2008 
## 
##  
##                | predict(best_svm_polynomial_fit, test_svm) 
## test_svm$Churn |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |      1369 |       109 |      1478 | 
##                |     0.926 |     0.074 |     0.736 | 
##                |     0.834 |     0.298 |           | 
##                |     0.682 |     0.054 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |       273 |       257 |       530 | 
##                |     0.515 |     0.485 |     0.264 | 
##                |     0.166 |     0.702 |           | 
##                |     0.136 |     0.128 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |      1642 |       366 |      2008 | 
##                |     0.818 |     0.182 |           | 
## ---------------|-----------|-----------|-----------|
## 
## 
cm_polynomial = confusionMatrix( predict(tune.out.polynomial$best.model,test_svm), reference = test_svm$Churn )
print('Overall of SVM polynomial kernel: ')
## [1] "Overall of SVM polynomial kernel: "
cm_polynomial$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##        0.76345        0.38861        0.74423        0.78189        0.73606 
## AccuracyPValue  McnemarPValue 
##        0.00266        0.78309
print('Class of SVM polynomial kernel: ')
## [1] "Class of SVM polynomial kernel: "
cm_polynomial$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##                0.842                0.545                0.838 
##       Neg Pred Value            Precision               Recall 
##                0.553                0.838                0.842 
##                   F1           Prevalence       Detection Rate 
##                0.840                0.736                0.620 
## Detection Prevalence    Balanced Accuracy 
##                0.740                0.693

SVM with sigmoid kernel:

c.sigmoid <- CrossTable(test_svm$Churn, predict(best_svm_sigmoid_fit,test_svm), prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2008 
## 
##  
##                | predict(best_svm_sigmoid_fit, test_svm) 
## test_svm$Churn |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |      1098 |       380 |      1478 | 
##                |     0.743 |     0.257 |     0.736 | 
##                |     0.753 |     0.691 |           | 
##                |     0.547 |     0.189 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |       360 |       170 |       530 | 
##                |     0.679 |     0.321 |     0.264 | 
##                |     0.247 |     0.309 |           | 
##                |     0.179 |     0.085 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |      1458 |       550 |      2008 | 
##                |     0.726 |     0.274 |           | 
## ---------------|-----------|-----------|-----------|
## 
## 
cm_sigmoid = confusionMatrix( predict(tune.out.sigmoid$best.model,test_svm), reference = test_svm$Churn )
print('Overall of SVM sigmoid kernel: ')
## [1] "Overall of SVM sigmoid kernel: "
cm_sigmoid$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##       7.33e-01       1.06e-01       7.13e-01       7.52e-01       7.36e-01 
## AccuracyPValue  McnemarPValue 
##       6.49e-01       1.97e-60
print('Class of SVM sigmoid kernel: ')
## [1] "Class of SVM sigmoid kernel: "
cm_sigmoid$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##                0.947                0.134                0.753 
##       Neg Pred Value            Precision               Recall 
##                0.477                0.753                0.947 
##                   F1           Prevalence       Detection Rate 
##                0.839                0.736                0.697 
## Detection Prevalence    Balanced Accuracy 
##                0.926                0.541
unloadPkg("caret")

Finally, we plot ROC plot for comparison among the models we built above.

x.svm.linear.prob <- predict(best_svm_linear_fit, type="prob", newdata=test_svm, probability = TRUE)
x.svm.linear.prob.rocr <- prediction(attr(x.svm.linear.prob, "probabilities")[,2], test_svm$Churn)
x.svm.linear.perf <- performance(x.svm.linear.prob.rocr, "tpr","fpr")
plot(x.svm.linear.perf, col=4)

# x.svm.prob <- predict(best_svmfit, type="prob", newdata=test_svm, probability = TRUE)
# x.svm.prob.rocr <- prediction(attr(x.svm.prob, "probabilities")[,2], test_svm$Churn)
# x.svm.perf <- performance(x.svm.prob.rocr, "tpr","fpr")
# plot(x.svm.perf, col=5, add=TRUE)

# x.svm.polynomial.prob <- predict(best_svm_polynomial_fit, type="prob", newdata=test_svm, probability = TRUE)
# x.svm.polynomial.prob.rocr <- prediction(attr(x.svm.prob, "probabilities")[,2], test_svm$Churn)
# x.svm.polynomial.perf <- performance(x.svm.polynomial.prob.rocr, "tpr","fpr")
# plot(x.svm.perf, col=6, add=TRUE)

# x.svm.sigmoid.prob <- predict(best_svm_sigmoid_fit, type="prob", newdata=test_svm, probability = TRUE)
# x.svm.sigmoid.prob.rocr <- prediction(attr(x.svm.prob, "probabilities")[,2], test_svm$Churn)
# x.svm.sigmoid.perf <- performance(x.svm.sigmoid.prob.rocr, "tpr","fpr")
# plot(x.svm.perf, col=7, add=TRUE)

# Draw a legend.
# legend(0.7, 0.3, c( 'logistic','KNN','Classification Tree','svm linear'), 1:4) #with KNN
legend(0.7, 0.3, c( 'logistic','Classification Tree','svm linear'), c(1,3,4))#withou KNN

#logistic
x.glm.prob <- predict(logistic_7, type = "response" )
x.glm.prob.rocr <- prediction(x.glm.prob, customer$Churn)
x.glm.perf <- performance(x.glm.prob.rocr, "tpr","fpr")
plot(x.glm.perf, col=1, add=TRUE)

#tree
x.tree.prob <- prediction(tree.predict.prob, customerNum$Churn)
plot(performance(x.tree.prob, "tpr", "fpr"), , col=3, add=TRUE)

#knn
# x.knn.prob <- prediction(knn.roc.prob, cus_test_y)
# plot(performance(x.knn.prob, "tpr", "fpr"), , col=2, add=TRUE)
tab <- matrix(c( 0.806,0.539,0.672,0.598,0.791,0.630,0.481,0.546,0.804,0.655,0.553,0.6, 0.793,0.474,0.651,0.549), ncol=4, byrow = TRUE)
colnames(tab) <- c("Accuracy", "Precision", "Recall", "F-1")
rownames(tab) <- c("SVM", "KNN with selected variables", "Logistic with selected variables","Classification Tree with selected variables")
tab <- as.table(tab)
xkabledply(tab, "Models Comparison")
Models Comparison
Accuracy Precision Recall F-1
SVM 0.806 0.539 0.672 0.598
KNN with selected variables 0.791 0.630 0.481 0.546
Logistic with selected variables 0.804 0.655 0.553 0.600
Classification Tree with selected variables 0.793 0.474 0.651 0.549

Even though SVM model won the evaluation according to the accuracy among these model, we still have to choose logistic regression model as our best classification model because of its performance on the ROC Plot and its interpretability.

7 Chpter 7 Conclusion

7.1 Question 1: What are the factors that influence attrition? What are the characteristics of the churn-prone population?

  1. The factors that influence churn are: age of the customer, married or not, family or not, phone service or not, internet provider or not, internet add-on service or not, contract signing agreement, billing method, payment method, length of time on the network, monthly consumption

  2. The characteristics of the churn-prone group are: old group, single people, no family members, phone service, fiber optic, no Internet add-on service, monthly contract signing, paperless billing, electronic payment, monthly consumption of $70-110, time on the network of 0-6 months

  3. Combined with the above analysis, subscriber churn in telecommunication companies is mainly due to two reasons.

  • the network services provided by the company, especially Fiber optic network services may make the user experience poor. In this part of users, the churn rate is high.
  • The high churn rate of the telecommunication company as a whole is caused by the large number of new subscribers. In particular, churn is most severe among subscribers who have been on the network for 6 months or less.

7.2 Question 2: Retention suggestions for high churn users

  1. Improve services, improve basic network services and enhance users’ experience.

  2. Implement preferential programs.

  • Provide pulling new subsidy discount service to all users, for each new user introduced to the company, then provide a certain fee reduction. While acquiring new users, increase the loyalty of old users.
  • Provide a fee waiver or other free services to each new user. Increase the retention rate of new users.
  • Lead users to change the way they sign contracts through the discount program, and lead users who sign contracts by month (Month-to-Month) to those who sign contracts by year.
  1. Provide simple and convenient services for elderly users to reduce the loss of this user group.

  2. Improve the payment process, especially the electronic check channel.

7.3 Question 3: Construct a churn classifier and output the probability that it is a churned user

From the test results, the SVM model has the highest accuracy of 80.6%, indicating that 80.6% of the samples are marked correctly. The recall rate of 67.2% indicates that the users who are marked as churned account for 67.2% of the actual churned users, which means that 32.8% of the potential churned users will be missed, and we need more churned user data to train the classifier, which can improve the recall rate. The validation set performs better than the test set in terms of recall, and the model generalizes well.